Goto

Collaborating Authors

 training game


SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Liu, Bo, Guertler, Leon, Yu, Simon, Liu, Zichen, Qi, Penghui, Balcells, Daniel, Liu, Mickel, Tan, Cheston, Shi, Weiyan, Lin, Min, Lee, Wee Sun, Jaques, Natasha

arXiv.org Artificial Intelligence

Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.


Search-contempt: a hybrid MCTS algorithm for training AlphaZero-like engines with better computational efficiency

Joshi, Ameya

arXiv.org Artificial Intelligence

AlphaZero in 2017 was able to master chess and other games without human knowledge by playing millions of games against itself (self-play), with a computation budget running in the tens of millions of dollars. It used a variant of the Monte Carlo Tree Search (MCTS) algorithm, known as PUCT. This paper introduces search-contempt, a novel hybrid variant of the MCTS algorithm that fundamentally alters the distribution of positions generated in self-play, preferring more challenging positions. In addition, search-contempt has been shown to give a big boost in strength for engines in Odds Chess (where one side receives an unfavorable position from the start). More significantly, it opens up the possibility of training a self-play based engine, in a much more computationally efficient manner with the number of training games running into hundreds of thousands, costing tens of thousands of dollars (instead of tens of millions of training games costing millions of dollars required by AlphaZero). This means that it may finally be possible to train such a program from zero on a standard consumer GPU even with a very limited compute, cost, or time budget.


Read to Play (R2-Play): Decision Transformer with Multimodal Game Instruction

Jin, Yonggang, Zhang, Ge, Zhao, Hao, Zheng, Tianyu, Guo, Jiawei, Xiang, Liuyu, Yue, Shawn, Huang, Stephen W., Chen, Wenhu, He, Zhaofeng, Fu, Jie

arXiv.org Artificial Intelligence

Developing a generalist agent is a longstanding objective in artificial intelligence. Previous efforts utilizing extensive offline datasets from various tasks demonstrate remarkable performance in multitasking scenarios within Reinforcement Learning. However, these works encounter challenges in extending their capabilities to new tasks. Recent approaches integrate textual guidance or visual trajectory into decision networks to provide task-specific contextual cues, representing a promising direction. However, it is observed that relying solely on textual guidance or visual trajectory is insufficient for accurately conveying the contextual information of tasks. This paper explores enhanced forms of task guidance for agents, enabling them to comprehend gameplay instructions, thereby facilitating a "read-to-play" capability. Drawing inspiration from the success of multimodal instruction tuning in visual tasks, we treat the visual-based RL task as a long-horizon vision task and construct a set of multimodal game instructions to incorporate instruction tuning into a decision transformer. Experimental results demonstrate that incorporating multimodal game instructions significantly enhances the decision transformer's multitasking and generalization capabilities.


Enabling A Network AI Gym for Autonomous Cyber Agents

Li, Li, Rami, Jean-Pierre S. El, Taylor, Adrian, Rao, James Hailing, Kunz, Thomas

arXiv.org Artificial Intelligence

This work aims to enable autonomous agents for network cyber operations (CyOps) by applying reinforcement and deep reinforcement learning (RL/DRL). The required RL training environment is particularly challenging, as it must balance the need for high-fidelity, best achieved through real network emulation, with the need for running large numbers of training episodes, best achieved using simulation. A unified training environment, namely the Cyber Gym for Intelligent Learning (CyGIL) is developed where an emulated CyGIL-E automatically generates a simulated CyGIL-S. From preliminary experimental results, CyGIL-S is capable to train agents in minutes compared with the days required in CyGIL-E. The agents trained in CyGIL-S are transferrable directly to CyGIL-E showing full decision proficiency in the emulated "real" network. Enabling offline RL, the CyGIL solution presents a promising direction towards sim-to-real for leveraging RL agents in real-world cyber networks.


Artificial intelligence (AI) Real or Fake Text? We Can Learn to Spot the Difference

#artificialintelligence

The most recent generation of chatbots has surfaced longstanding concerns about the growing sophistication and accessibility of artificial intelligence. Fears about the integrity of the job market -- from the creative economy to the managerial class -- have spread to the classroom as educators rethink learning in the wake of ChatGPT. Yet while apprehensions about employment and schools dominate headlines, the truth is that the effects of large-scale language models such as ChatGPT will touch virtually every corner of our lives. These new tools raise society-wide concerns about artificial intelligence's role in reinforcing social biases, committing fraud and identity theft, generating fake news, spreading misinformation and more. A team of researchers at the University of Pennsylvania School of Engineering and Applied Science is seeking to empower tech users to mitigate these risks.


Bootstrapped Q-learning with Context Relevant Observation Pruning to Generalize in Text-based Games

Chaudhury, Subhajit, Kimura, Daiki, Talamadupula, Kartik, Tatsubori, Michiaki, Munawar, Asim, Tachibana, Ryuki

arXiv.org Machine Learning

We show that Reinforcement Learning (RL) methods for solving Text-Based Games (TBGs) often fail to generalize on unseen games, especially in small data regimes. To address this issue, we propose Context Relevant Episodic State Truncation (CREST) for irrelevant token removal in observation text for improved generalization. Our method first trains a base model using Q-learning, which typically overfits the training games. The base model's action token distribution is used to perform observation pruning that removes irrelevant tokens. A second bootstrapped model is then retrained on the pruned observation text. Our bootstrapped agent shows improved generalization in solving unseen TextWorld games, using 10x-20x fewer training games compared to previous state-of-the-art methods despite requiring less number of training episodes.


A Tic Tac Toe AI with Neural Networks and Machine Learning

#artificialintelligence

This article is my entry for CodeProject's AI competition "Image Classification Challenge"[ ]. My goal was to teach a neural network to play a game of tic tac toe, starting from only knowing the rules. Tic tac toe is a solved game. A perfect strategy[ ] exists so a neural network is a bit overkill and will not perform as well as existing programs and humans can. Described from a high level: when the AI needs to make a move, it iterates over all possible moves, generates the board after making a given move, and uses the neural network to see how good the position is after performing that move.


Race for the Galaxy AI

#artificialintelligence

What makes a game replayable over time? It offers new challenges over and over again. One way to do that is to include an AI opponent that is so skilled, even advanced players will continue to be challenged after hundreds of hours of play. Race has been one of the top selling boardgames this year partly because of the neural network that powers its AI. Race for the Galaxy uses a temporal difference neural network.


Brain training games may be a waste of time scientists say

Daily Mail - Science & tech

Trendy brain training computer games may be a waste of time and money, scientists have revealed. They said that while people may get better at the exercise they practise, there is little or no evidence this helps them in their day to day lives. Numerous companies make packages of games, puzzles and exercises that are designed to improve memory, boost attention span or simply keep the mind sharp into old age. Brain training computer games may be a waste of time and money, scientists said last night. Researchers examined more than 130 studies into brain training.


Pairwise Relative Offset Features for Atari 2600 Games

Talvitie, Erik (Franklin and Marshall College) | Bowling, Michael (University of Alberta)

AAAI Conferences

We introduce a novel feature set for reinforcement learning in visual domains (e.g. video games) designed to capture pairwise, position-invariant, spatial relationships between objects on the screen. The feature set is simple to implement and computationally practical, but nevertheless allows for substantial improvement over existing baselines in a wide variety of Atari 2600 games. In the most dramatic results the features allow multiple orders of magnitude improvement in performance.